Interpreting Sequence-Levenshtein distance for determining error type and frequency between two embedded sequences of equal length

Levenshtein distance is a commonly used edit distance metric, typically applied in language processing, and to a lesser extent, in molecular biology analysis. Biological nucleic acid sequences are often embedded in longer sequences and are subject to insertion and deletion errors that introduce frameshift during sequencing. These frameshift errors are due to string context and should not be counted as true biological errors. Sequence-Levenshtein distance is a modification to Levenshtein distance that is permissive of frameshift error without additional penalty. However, in a biological context Levenshtein distance needs to accommodate both frameshift and weighted errors, which Sequence-Levenshtein distance cannot do. Errors are weighted when they are associated with a numerical cost that corresponds to their frequency of appearance. Here, we describe a modification that allows the use of Levenshtein distance and Sequence-Levenshtein distance to appropriately accommodate penalty-free frameshift between embedded sequences and correctly weight specific error types.


Introduction
Levenshtein distance (LD) is a widely used edit distance metric [1].The LD algorithm identifies the number of insertions, deletions and substitutions needed to convert one sequence to another.LD can also assign weights to each error type.A weighted error has a numerical cost that is inversely associated with its appearance frequency.Common applications of LD include natural language processing such as speech recognition, dialect detection, plagiarism exposure and spell checking [2] [3] [4] [5].LD operates under the assumption of fixed sequence length, with the analytical window frame for computing distance including the full length of the two sequences.However, modifications to the LD algorithm allow comparison of embedded sequences, which experience error-induced frameshift, without any additional distance cost as described in the Sequence-Levenshtein distance (SLD) modification [6].However, SLD cannot accommodate weighted errors.The ability to allow for frameshift without penalty as well as weighted errors has direct relevance in many molecular biology applications.For example, DNA sequencing platforms can introduce a characteristic error profile into the nucleic acid sequence such that certain error types will occur more frequently than others.In these cases, the more frequently occurring error types should have a smaller error weight associated with their cost of distance as to reasonably accommodate them in sequence comparison [7].
Combining the benefits of weighted errors and frameshift accommodated can be accomplished by interpreting the location and value of the lowest value along the last column and last row in the completed unweighted LD table.In other words, by interpreting the SLD position and value on the completed LD matrix, the error types and frequency of appearance between the sequences can be determined.This strategy allows for error-specific weights to be added while also accommodating error-induced frameshift without additional penalty.The following mathematical conjecture describes the relationship between error type and frequency and the location of the lowest value along the last column and last row of the LD matrix.Examples are given for each case, illustrating how this information can be used for interpreting the error profile between strings, which can then be used to incorporate weighted LD with frameshift correction allowance into the sequence analysis.Let (a i , b j ) be the entry in the i th row and the j th column of a LD matrix table created by comparing two strings of the same length, n.Then (a n , b n ) will be the lowest right-hand entry in the table, positioned at the last row and last column.When an entry in an embedded string of interest is deleted or inserted, there is a frameshift to the left or to the right, while the window of analysis (the frame) remains the same since the lengths of the two strings are uniform.Any elements beyond the string of interest that constitutes its context will then fill in the empty space in the case of a deletion.In the case of an insertion, elements may be pushed outside the analytical window upstream or downstream.The position of each base in the sequence will be denoted as A LD matrix table is provided in Figure 1 as a reference for analyzing the changes in the following cases.The green highlighted cells that appear in the case matrices represent the SLD positions, which are the cells to be interpreted.

Case 1
Given k insertions, l deletions, any number of substitutions, no error at entry S n regardless of S n frameshift, and no insertion(s) between S n-1 and S n, the entry with the lowest value in column n and row n will be the following:

Case 1 example
To change "TAGCTAGC" to "TAGTAGCT", the operations include a deletion of "C" and the addition of the "T" on the 3' end due to frameshift.The computed LD matrix between these words and the green highlighted SLD placement and value is interpreted to reveal a single deletion as shown below in Figure 2. To change "TAGCTAGC" to "TATAGCTA", the operations can include either a deletion of "GC" and frameshift towards the 5' end that allows "TA" to enter the frame effectively making S n and S n-1 substituted, or an insertion of "TA" that pushes "GC" out of the frame on the 3' end, also making the S n and S n-1 substituted.The computed LD matrix between these words and the green highlighted SLD placement and value is interpreted to reveal either two deletions or two insertions as shown in Figure 3.

Case 3 example
To change "TAGCTAGC" to "ATAAGCTG", the operations can include either two insertions of "A" and one deletion of "A", two insertions of "A" and a substitution of "A" to "G" or three insertions of "A".The computed LD matrix between these words and the green highlighted SLD placement and value is interpreted to reveal either two insertions and one deletion, two insertions and one substitution, or three insertions as shown in Figure 4.

Case 4
Given a k = 1 insertion between S n-1 and S n and no error at S n or S n-1 , any offsetting error(s) elsewhere such that there is no downstream or upstream frameshift of S n , the entries that share the lowest values in column n and row n will be (a n , b n ), (a n , b n−1 ), and (a n , b n−2 )

Case 4 example
To change "TAGCTAGC" to "TAGCAGTC", the operations can include either a deletion of "T" and an insertion of "T", a deletion of "T" and substitution of "C" to "T", or a deletion of "T" and "C".The computed LD matrix between these words and the green highlighted SLD placement and value is interpreted to reveal either an insertion and deletion, a deletion and a substitution, or two deletions as shown in Figure 5.

Case 5
Given l = 1 deletions of S n-1 , no error at S n or S n-2 and no insertion between S n-2 and S n-3 , and any error(s) elsewhere such that there is no frameshift of S n , the entries that share the lowest values in column n and row n will be To change "TAGCTAGC" to "TAGTCTAC", the operations can include either an insertion of "T" and a deletion of "G" or an insertion of "T" and a substitution of "G" to "C" or an insertion of a "T" and a "C".The computed LD matrix between these words and the green highlighted SLD placement and value is interpreted to reveal either an insertion and deletion, an insertion and a substitution, or two insertions as shown in Figure 6.
10 Case 6 Given l = 1 deletions of S n-1 and a substitution error at S n such that the substitution doesn't equal the original S n-1 , and any error(s) elsewhere such that there is no frameshift of S n and the sequences don't match, the entries that share the lowest values in column n and row n will be (a n−1 , b n ), and (a n−2 , b n ).If S n experiences upstream frameshift under these conditions, the lowest values will follow the rules in conjecture 2. If S n experiences downstream frameshift under these conditions, the lowest values will follow the rules in conjecture 1.

Case 6 example
To change "TAGCTAGC" to "TAGTCTAT", the operations can include either an insertion of "T" and a substitution of "G" to "T" or an insertion of a "T" and another "T".The computed LD matrix between these words and the green highlighted SLD placement and value is interpreted to reveal either an insertion and a substitution or two insertions as shown in Figure 7.
12 Case 7 Given l > 1 consecutive deletion errors starting at S n-1, any error(s) elsewhere except consecutive off-setting insertions between S n−(l+1) and S n−(l+2) such that there is no frameshift of S n , the entries that share the lowest values in column n and row n will be (a n−l , b n ) and a n−(l+1) , b n .

Case 7 example
To change "TAGCTAGC" to "ATAAGACC", the minimum number of operations can include insertions of three "A" letters and a "C" or insertions of three "A" letters and a substitution of "T" to "C".The computed LD matrix between these words and the green highlighted SLD placement and value is interpreted to reveal either an insertion and a substitution or two insertions as shown in Figure 8.

Discussion
The position and value of SLD along the last column and last row of the LD matrix can reveal the error types and appearance frequencies between two sequences of the same length.Insertions move the SLD from the corner up the i border (the last column) whereas deletions move the SLD from the corner to the left along the j border (the last row).An insertion matched with a deletion does not move the SLD position along the matrix and neither do substitutions.Interestingly, an insertion and a deletion pair can produce the same result as a substitution in a sequence if they occur in the same place.However, an insertion and a deletion pair are two errors as opposed to a single substitution error, so distinguishing between these two options matters when counting errors.If an SLD occurs in the corner of the i and j borders and the value is 2 or more, it could represent one or more insertion-deletion pairs, all substitutions or a combination of both between sequences.If an SLD occurs in the corner of the i and j borders and the value is 1, it represents a single substitution error.Furthermore, some values of two or more in the corner of the i and j borders cannot be substitutions, but rather represent only insertion and deletion pairs, as demonstrated in the examples for conjectures 4 and 5. Therefore, if there is ambiguity when interpreting SLD for error type and frequency at the corner of the i and j borders, it can be resolved by applying any known probability of whether a substitution is expected to occur more, less or the same as an insertion-deletion pair.Similarly, conjecture 2 describes a scenario where either all insertions or an equal number of deletions can make one sequence match the other.In order to make a determination of which of the two error types is likely responsible for the sequence change for any given analysis, the expected appearance relationship between them for that specific analysis needs to be known.An example of using known probabilities to guide decision making is relying on a specific DNA sequencer's error hallmark.
We wish to thank Boris Yazlovitsky, Greg Shomo, Mariana Levi and the rest of the Research Computing team at Northeastern University for their support.

4 Case 2
Given d consecutive deletion-induced, insertion-induced or bona fide substitution(s) that start at S n and accumulate upstream and no downstream frameshift of the substitutions, regardless of error(s) elsewhere, provided an equal number of insertions or deletions can account for the changes between the sequences, the entries that share the lowest values in column n and row n will be in the positions (a n , b n−y ) and (a n−y , b n ) for all y ∈ {0, 1, . . ., d} Case 2 example

5 Case 3
Given d > 1 consecutive deletion-induced, insertion-induced or bona fide substitution(s) that start at S n and accumulate upstream and p insertion-induced downstream frameshifts of the substitutions such that at least one substitution remains at S n , regardless of error(s) elsewhere, the entries that share the lowest values in column n and row n will be in positions a n−(p) , b n and a n−(p+k) , b n for k ∈ {1, 2 . . .y}. y = the number of downstream substitutions left in the analytical window.If frameshift-inducing deletions or insertions neighbor the substitutions, behavior will match case 2.